A Online Appendix to: Analysis and Optimization for Boolean Expression Indexing

نویسندگان

  • Mohammad Sadoghi
  • Hans-Arno Jacobsen
چکیده

String tokenization using q-grams maps the string into a high-dimensional vector space model, in which the domain of each dimension is binary. The size of this space is exponential in the length of q-grams. For instance, q-grams of size three results in a space with 26 dimensions. The vector space model representation of a tokenized string (e.g., {‘str’, ‘tri’, ‘rin’, ‘ing’}) can be expressed by setting dimensions associated to each of its grams (e.g., ‘str’, ‘tri’, ‘rin’, and ‘ing’) to 1 and everything else to 0. Alternatively, we can concisely express this as binary equality predicates, e.g., the predicate with dimension (i.e., attribute) ‘str’ must be equal to 1 and ignoring dimension with value 0. Therefore, instead of expressing a tokenized string as a vector of 0s or 1s, we can express it as a set of equality predicates. We can also take this one step further by going beyond binary domains to capture more interesting relationships among q-grams and as a byproduct reducing the space dimensionality. The original model mapped each q-gram to a dimension, instead we can map only the prefix of each q-gram to a dimension and map the rest as a value in the corresponding dimension. For example, the q-gram ‘str’ is mapped to the quality predicate [‘st’ = ‘r’], ‘st’ now represents the predicate’s dimension and ‘r’ is mapped to the value in this dimension. In this new mapping, the number of dimension is reduced from 26 to 26, and there is a new opportunity to express similarity among overlapping q-grams. For example, the q-grams ‘str’ and ‘ste’ are now both mapped to dimensions represented by ‘st’ (already signifying a similarity among these two q-grams), but also the value ‘r’ and ‘e’ can play an important role. For instance, since the letters ‘r’ and ‘e’ are adjacent on standard U.S. keyboard, then it is possible, that due to typing error q-grams ‘str’ was entered as ‘ste’. This relationship can be captured by mapping ‘r’ to ‘e’ to spatially close values.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Basic Goals: Separation of Concerns Generate efficient code sequences for individual operations Keep it fast and simple: leave most optimizations to later phases Provide clean, easy-to-optimize code IR forms the basis for code optimization and target code generation

Assumptions Intermediate language: RISC-like 3-address code‡ Intermediate Code Generation (ICG) is independent of target ISA Storage layout has been pre-determined Infinite number of registers + Frame Pointer (FP) Q. What values can live in registers? ‡ ILOC: Cooper and Torczon, Appendix A. Strategy 1. Simple bottom-up tree-walk on AST 2. Translation uses only local info: current AST node + chi...

متن کامل

A Analysis and Optimization for Boolean Expression Indexing

BE-Tree is a novel dynamic tree data structure designed to efficiently index Boolean expressions over a high-dimensional discrete space. BE-Tree copes with both high-dimensionality and expressiveness of Boolean expressions by introducing a twophase space-cutting technique that specifically utilizes the discrete and finite domain properties of the space. Furthermore, BE-Tree employs self-adjustm...

متن کامل

Indexing Boolean Expressions

We consider the problem of efficiently indexing Disjunctive Normal Form (DNF) and Conjunctive Normal Form (CNF) Boolean expressions over a high-dimensional multi-valued attribute space. The goal is to rapidly find the set of Boolean expressions that evaluate to true for a given assignment of values to attributes. A solution to this problem has applications in online advertising (where a Boolean...

متن کامل

Propagation Models and Fitting Them for the Boolean Random Sets

In order to study the relationship between random Boolean sets and some explanatory variables, this paper introduces a Propagation model. This model can be applied when corresponding Poisson process of the Boolean model is related to explanatory variables and the random grains are not affected by these variables. An approximation for the likelihood is used to find pseudo-maximum likelihood esti...

متن کامل

Reliability assessment of power distribution systems using disjoint path-set algorithm

Finding the reliability expression of different substation configurations can help design a distribution system with the best overall reliability. This paper presents a computerized a nd implemented algorithm, based on Disjoint Sum of Product (DSOP) algorithm. The algorithm was synthesized and applied for the first time to the determination of reliability expression of a substation to determine...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011